DOCUMENT RESUME 



ED 081 205 



EM Oil 376 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Utter ^ Merlin; Wilkinson, John 
Some Classroom Experiences in the Teaching of 
Empirical Model Building and Regression Analysis. . 
Jun 73 

3p. ; paper presented at the Conference on Computers 
in the Undergraduate Curricula (Claremont, 
California, June 18-20, 1973) 

MF-$0,65 HC-$3,29 

Calculation; College Mathematics; ^Computer Assisted 
Instruction; Computer Programs; Digital Computers; 
Higher Education; ^Mathematical Models; *Matbematics 
Instruction; *Multiple Regression Analysic^; Program 
Descriptions; *Statistics ; Time Sharing; 
Undergraduate Study 

BMD; CAI; LINKEG; RPIREG; STEPREG 



ABSTRACT 

The use of the digital computer for the presentation 
of the topics of empirical model building and regression analysis is 
discussed. The author concentrates upon a description of computing 
exercises which are employed to provide the students with experience 
in model ^building and evaluation in a controlled situation. The types 
of exercises given are treated, followed by a discussion of the 
relative merits and dysfunctional aspects of the time- sharing and 
batch modes of operation. .DeJ:ails are presented concerning the main 
programs accessed by the students — the BMD multiple and stepwise 
regression programs, RPIREG, STEPREG, and LINREG. Finally, there is 
consideration of the strengths and weakness of the computer- as sis ted 
instructional (CAI) approach to these topics. (PB) 



ERLC 



FILMED FROM BEST AVAILABLE COPY 



SOME CLASSROOM EXPE.filENCES IN THE TEACHING OF E«?IRICAL MODEL BOILDING AND REGRESSION ANALVS : 



us OEPARTMENTOP HEALTH. 
EDUCATION & WELPARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCOWF.NT HAS BEEN REPPO 
DUCED EX/.CTLy AS RECEIVED I^ROV 
IMF PEP"iON Ot? ORGANIZATION ORIGIN 
tTiNG IT POINTS Qr VEW 0» OPINIONS 
STATED DO NOT NECESSARILY REPRE 
SENTOF* fCIAL N&TlONA!. INSTITUTE Qf 
EDUCATION POSITION Of? POLlCV 



Merlin Utter and John W. Qi.lkirson* 
Operation^ Research and Statistics 
HenssQlaev Polytechnic Iiistitutt* 
Troy, New iforX 
(518) 270-6565 



Hithput the computer the presentation of a topic such as empirical model building an ? 
regression analysis would be v-iuite sterile* HencGx in a course involving a heterogenec:if 
dixtiice of undergraduate and graduate students, ir.vclvement of the digital computer has b«oof 
invaluable. In presenting the material, the following three vehicles have- helped to immovr:*,' 
the stndent in the subject: (a) ^roblero sets, which provide a f amiliariijation with th^ 
avaiianle ccnputing power as wall as instruction in the topics of regression, analysis; (bj 
E£cjects, which introduce the student to the importance and djif f iculties of probi.?r. 
fornsulati cn as well as the prcbleits encountered with the gathering aad handling of real at^tri; 
and {c) computing exercises, which provide an exp^erience in model building and e^^^Xiiat ion v': 
c; ccTizroilo6 ;tii*cuatio.^ , Thic pui.>^r will concentrate on tre computing e:;orciGe£ poet . vol :; f 
the courSQe 

For the computer exercises, the students are provided with sets of data which have ho-r 
artificially generated. Rather than fit a specific model tc ths data, as is commonly don*-, ir. 
prcblem sets, the students try tcfc estimate the regression model from which the data 
generated. In the computing exercises, unlike the proj'=cts, the true model from which tl;^ 
data were generated is known, and it is felt that this infcrsoation can eventually be U:-:?ei\ 
the student to provide a capacity cf ccoparison of his model with the true situation. Thii-. 
would allow feedback that might provide insight into what moves were important in isakiun 
either the right or wrong ccnclusicns in the modeling process. The data are generated by vh r 
instructor in a simple format. Once the model is decided upon, the response or depen-lcnt 
variable is generated without error as a known function cf cne or more independent or 
predictor factors. These independent factors are either chosen randomly over sotso 
preassj-gned factor space or according to some designed experiment. Once the error of soat> 
proscribed form has been generated, addition to the prt2vdously obtained errorless respon?f> 
val'ie provides the "observed^* value for the dependent variable. These corresponding values 
of the predictor factors and response variables are generated by the computer, requiring mi 
input only the model to be used, the form of error desired, and the number and location ot 
the observations to be taken. 

The error chosen can be cf such a magnitude that it is either very difficult or much toe 
easy, to adequately fit a model to the generated data. For instance, the first set of data 
given to the student was generated froBj a low-crder pclyncmial with extreaely little error. 
The result was an uncha llenging and most uninteresting rapid convergence to the correct nodol 
for the student. On the other hand, students should not by provided with data involving such 
a large random error that they are lulled into thinking that nothing but a n-1st order 
polynomial can provide an adequate fit. After ex pericsntaticn with the magnitude cf ertor 
reiati<re xo the range of the response variable, appropriate values were found for the 
students* model building and- evaluation experiments. 

Various types of exercises are given to the students. The first exercise introduces 
them to the techniques involved in solving a simple polynomial of either one or two variables 
of second order or less; and usually missing one of the terms, such as the cross-product 
term. Subsequent data sets are generated from models involving more complicated functions ot 
the ind^^pendent variables, such as Vx", sin(x) and l/x. Models involving transformations cn 
the dependent variable, such as 1/y, VT 1^ Yi have added spice tc the model building gaa^. 
Data generated from these latter types of models has created valuable learning exper ienc^jr , 
due to the stranc?e behavior of ths residuals and other statistics obtained when fitting th* 
wrong KOdel. Also it has been interesting to occasionally add a factor which is B^?rcl7 
randoii noise and actually has no influence on the true model. This is importa^tt "^because t-^ 
student would soon find out if one always presentesl significant factors and this wo«:li^ 
considerably influ3nce his model building. In other exercises, even though data have beon 
generated from a model with two factors, the stud»?nts are provided only one of th« 
independent factors along with the dependent variable generated for the complete model. Sucr^ 
an exercise has provided an excellent introduction tc the effects of missing variables •i-'^ 
well as an awareness of the possible need to search for additional explanations of ^ 
dependent response. In the exercise, the students were initially perplexed when th<?y 
obtained highly significant parameters and regression sup of squares but unusual residual 
plots. To complete the exercise, the "lost variable" was provided to give the students th« 



opportunity .to re-evaluate and modify their model based cn this new and aiore complete 
information* — 

Some exercises have dealt with different sets of data generated for the same nod^sl, but 
under various designs, thus providing a comparison of their respective powers for evaluating 
"goodness of fit," Such examples have been: (a) cyclic and factorial designs to fj.t a 
second-order r two-variable polynoaial; (b) designs with n/3 replicates at each of thre^ 
equally spaced points, n/6 replicates at each of six equally spaced points and n equally 
spaced points to fit a second-order r one-variable polynomial; and (c) designs with n/6 
replicates at each of six equally spaced points and n equally spaced points to fit a third- 
order, one-variable polynomial. Iiom exercises liYe these, there are often some side 
benefits that make significant contributions to the learning process* For instance, when 
fitting a second-order polynomial tc data from a third-order, one-factor polynomial, a higher 
R2 value (B representing the multiple correlation coefficient) was observed than when the 
correct model was fit to data generated from a second-order polynomial (obtained by 
eliminating the cubic term from the model above) . This apparent anomaly is due to the larger 
sum of squares involved in the first situation. However, it provided a very sobering message 
as far as creating some impressions ^relating magnitude of to the goodness of the model. 

These computer exercises vould be carried out in either time-sharing or batch modes of 
operation. The main programs accessed by the students were the BHD multiple and stepwitio 
regression programs, fiPIREG," SXjSFREG and LIWKSG, the latter being specially written with the 
computing exercisje^suJ^n^mind. 

All the programs provide the standard correlation matrix, variance-covariance matrix, 
parameter estimates with their associated standard deviations and t-statistic values, ANOVA 
table, R2 value and various printer plots of residuals. Ihe stepwise programs also provide 
the partial correlations with the response variable of those factors not yet in the 
regression. In addition, LISREG allows inequality constraints on the parameters as well as 
the ability to test hypotheses of linear combinations of the parameters. Another side effect 
of the computer exercises has been their indirect effect on the refinement of applicable 
computer programs* 

At the onset, it was thought that these computer exercises would best be done in the 
time-sharing mode and thereby fully utilize the benefits of such an interactive system, where 
model after model could be sequentially run in a logical fashion, leading toward a "good'* 
model. In fact, the LINREG program allows one to choose each variable to be entered or 
deleted in the stepwise procedure manually in a true interactive manner. However, time- 
sharing is not crucial for this type of voric and its use was not insi;3ted upon. The result 
has been that this mode has not received utilization tc the extent expected, due to many 
reasons, some involving program sophistication and others related to computer* system 
utilization. Because of an overloaded computer system, elapsed time at the remote terminal 
has been too long for the amount cf actual computing performed. This has been the main 
reason for students "giving up" on the time-sharing mode and going to a batch mode of 
operation. Another contributing factor has been the necessity during the day to sign up for 
terminal use, requiring the student to adapt his schedule to terminal availability. Also a 
computer system change early in the course resulted in a decline in the reliability of the 
time-sharing mode. Often the student would experience system crashes essentially requiring 
him to start over again. Also, wh'eh"the system was working well., there vas the temptation 
(often taken) to use the "shotgun" approach and to try as many models as possible without 
much thought other than to run as many as possible in the time the student had been assigned 
to the terminal. This tended to defeat any advantage that the interactive "instant 
turnaround" time-sharing mode was supposed to offer. 

To cut down on the load added to the system and to reduce the computer costs for the 
course, groups of from 2 to 4 students were formed to jointly wcrk on the computer exercises 
rather than each individual doing each exercise independently. Although this approach posed 
the danger of potentially allowing some students to coast, it had the advantage of 
encouraging interaction with each other which aided the model building process. Each group 
received a different set of data, often from different models. This encouraged independent 
work and also provided the class with a variety of experiences for later class discussion. 

The use 6t the batch mode also had some problems* At certain times daring the semester f 
the turnaround was quite slow, again pressuring the studerit to consider the "shotgun" 
approach to run many models with the hopes that if yoK\ try enough you might be lucky and pick 
a winner* To help avoid this^ .the sti^dents were given longer to complete the exercises so 
they would feel nc real time constraint. In addition, the form of the written report 
required for each computer exercise was altered. At first, very few instructions were 
provided concerning the form that the report should take or the technique used to obtain the 
rjod-^L that the group felt best fit the data. As tiight have been expected, the result was a 



ERLC 



barrage of. coaputGr printouts, the output from all the models each group considered worth 
running. Besides requiring unnecessary acaounts of coaputer time, this shotgun approach 
resulted ix) a rainiaal gain in knowledge of [nodel buildinc. Many students were content to 3et' 
the ^;tepvise program do the work, and merely fit that polynomial which best fit the data, 
regardless of the true model. To {3iscourage these p^iractices, a step-by-step procedure to 
obtain the resulting model was required* It was stressed that the students use the data and 
any previously run models to determine the next model- to be tried. Each step of this logical 
procedure involving how they decided the next model to try, as well as the pertinent 
information derived from each model attempted, was to be documented in the write-up. As part 
.of the an alysis the students were to discuss the goodness of the fit, the precision of the 
estimates, the examination of residuals by both graphical procedures as well as by the use of 
various statistical tests and any possible signs that the error was not completely random, • 
The results were clearly better. 

Although the results of our computer exercises . have been extremely valuable in 
presenting the concepts of mode^ building and impressing the students with the effect of 
factor space coverage on thi.s process, there are many improvements to be made. Students 
still run many more models than they need, and seem willing tc substitute a little more 
keypunching of new models for a little less thinking and inspection of the results already 
obtained. Because of the format of the write-upr some students report on only those models 
which appear good. One possible solution is to monitor the aricunt of computer time used by 
each group, and use this time as a measure of their efficiency in the modeling process. In 
the past, only a typed copy of the data has been provided, forcing the students to type in 
the data themselves each time the terminal is to be used. To alleviate this situation, and 
to permit Enore variety in saople size it is planned to store the data on files in the 
computer for easy access by the student. Also planned is the development of other types of 
exercises which, among other things, will allow experimentation examining the violation of 
the various assumptions relating to such features as common variance and additive error. 



♦John W, HilkiTison will handle correspondence. 



ERLC 



